{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 使用 TVM nn.module 工作流在 MLC-LLM 中添加新模型架构\n",
        "\n",
        "在本教程中，将演示如何使用新的 TVM nn.module 工作流在 MLC-LLM 中添加新模型架构。TVM nn.module 是新的模型编译工作流，旨在为 MLC-LLM 带来模块化的 Python 优先编译，使用户和开发者能够更无缝地支持新模型和功能。\n",
        "\n",
        "例如，在 TVM nn.module 工作流下，定义 Mistral 模型架构所需的代码量仅为旧工作流的一半左右。从高层次来看，TVM nn.module 与 PyTorch nn.module 接口非常相似。\n",
        "\n",
        "在这里，将使用 [GPT-2](https://huggingface.co/gpt2) 进行演示。GPT-2 是以自监督方式在非常大的英语语料库上预训练的 transformers 模型，可用于猜测句子中的下一个单词。它在 Huggingface 中的[模型定义](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py)可以找到。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IAp_SpW6sN2L"
      },
      "source": [
        "## 定义 GPT-2 模型\n",
        "\n",
        "在 `mlc-llm/python/mlc_llm/model/` 下创建 `gpt2` 文件夹。其结构将如下所示：\n",
        "\n",
        "```\n",
        "mlc-llm/python/mlc_llm/model/gpt2/\n",
        "├── gpt2_loader.py          # 从 Huggingface 加载并转换权重\n",
        "├── gpt2_model.py           # 定义模型架构和配置\n",
        "├── gpt2_quantization.py    # 定义量化方案\n",
        "└── __init__.py\n",
        "```\n",
        "\n",
        "首先关注 `gpt2_model.py`。该文件使用 `tvm.relax.frontend.nn.Module` 以模块化的方式定义 GPT-2 模型架构，类似于 PyTorch 的对应部分。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {},
      "outputs": [],
      "source": [
        "from set_env import temp_dir"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_-O_-R7rsN4p"
      },
      "source": [
        "### 在 `gpt2_model.py` 中定义配置类\n",
        "\n",
        "首先，定义配置类，它几乎是从 Huggingface 的 [GPT2Config](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/configuration_gpt2.py) 直接翻译过来的。该类的属性应与 Huggingface 配置中的相应属性同名，否则 Huggingface 配置将无法正确加载。\n",
        "\n",
        "`__post_init__` 函数在所有数据类属性初始化后被调用。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "z8xwX363wsqx",
        "tags": [
          "hide-output"
        ]
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\u001b[0;31mInit signature:\u001b[0m\n",
            "\u001b[0mGPT2Config\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mvocab_size\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mn_embd\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mn_layer\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mn_head\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mlayer_norm_epsilon\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mfloat\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mn_inner\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mcontext_window_size\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mprefill_chunk_size\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mscale_attn_by_inverse_layer_idx\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mbool\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mtensor_parallel_shards\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mhead_dim\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mmax_batch_size\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mkwargs\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mDict\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mAny\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m<\u001b[0m\u001b[0mfactory\u001b[0m\u001b[0;34m>\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
            "\u001b[0;31mSource:\u001b[0m        \n",
            "\u001b[0;34m@\u001b[0m\u001b[0mdataclasses\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdataclass\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m\u001b[0;32mclass\u001b[0m \u001b[0mGPT2Config\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mConfigBase\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m  \u001b[0;31m# pylint: disable=too-many-instance-attributes\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0;34m\"\"\"Configuration of the GPT-2 model.\"\"\"\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mvocab_size\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mn_embd\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mn_layer\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mn_head\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mlayer_norm_epsilon\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mfloat\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mn_inner\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mcontext_window_size\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mprefill_chunk_size\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mscale_attn_by_inverse_layer_idx\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mbool\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mtensor_parallel_shards\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mhead_dim\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mmax_batch_size\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0mkwargs\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mDict\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mAny\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdataclasses\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfield\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdefault_factory\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdict\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0;32mdef\u001b[0m \u001b[0m__post_init__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_inner\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_inner\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_inner\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m4\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_embd\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcontext_window_size\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;32mfor\u001b[0m \u001b[0mname\u001b[0m \u001b[0;32min\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m\"n_positions\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"max_sequence_length\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;32mif\u001b[0m \u001b[0mname\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcontext_window_size\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpop\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0mlogger\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minfo\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                        \u001b[0;34m\"%s not found in config.json. Falling back to %s (%d)\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                        \u001b[0mbold\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"context_window_size\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                        \u001b[0mbold\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                        \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcontext_window_size\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;32mbreak\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m\"Unable to determine the maximum sequence length, because none of \"\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m\"`context_window_size`, `n_positions` or `max_sequence_length` is \"\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m\"provided in `config.json`.\"\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhead_dim\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhead_dim\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_embd\u001b[0m \u001b[0;34m//\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_head\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0;32massert\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhead_dim\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_head\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_embd\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprefill_chunk_size\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0mlogger\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minfo\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"%s defaults to %d\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0mbold\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"prefill_chunk_size\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0mmin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcontext_window_size\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m8192\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprefill_chunk_size\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcontext_window_size\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m8192\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0;32melif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprefill_chunk_size\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcontext_window_size\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0mlogger\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minfo\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"Overriding %s from %d to %d\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0mbold\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"prefill_chunk_size\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprefill_chunk_size\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0mmin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcontext_window_size\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m8192\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprefill_chunk_size\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcontext_window_size\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m8192\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
            "\u001b[0;31mFile:\u001b[0m           /media/pc/data/lxw/ai/mlc-llm/python/mlc_llm/model/gpt2/gpt2_model.py\n",
            "\u001b[0;31mType:\u001b[0m           type\n",
            "\u001b[0;31mSubclasses:\u001b[0m     "
          ]
        }
      ],
      "source": [
        "from mlc_llm.model.gpt2.gpt2_model import GPT2Config\n",
        "\n",
        "GPT2Config??"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "50zWSA0ksN7T"
      },
      "source": [
        "### 在 `gpt2_model.py` 中定义模型架构\n",
        "\n",
        "使用 {class}`tvm.relax.frontend.nn.Module`，能够以模块化的方式定义模型架构。它看起来与 PyTorch 风格非常相似，只是前向函数实际上并不执行计算。它使用作为输入传递的占位符来跟踪计算图。\n",
        "\n",
        "你可以选择使用 `op._print(some_tensor)` 在运行编译模块时打印张量的中间值。如果你这样做，你必须在 `export_tvm()` 和 `jit()` 中指定 `debug=True`。除了手动打印外，还提供了[端到端的调试模块 `DebugChat`](#Debug-Compiled-MLC-Model-with-DebugChat)，它将自动转储所有层的中间值。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "id": "fh_6l1Ul1sWA",
        "tags": [
          "hide-output"
        ]
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\u001b[0;31mInit signature:\u001b[0m \u001b[0mGPT2Attention\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mconfig\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mmlc_llm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgpt2\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgpt2_model\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mGPT2Config\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
            "\u001b[0;31mDocstring:\u001b[0m     \n",
            "Base class for neural network components. Subclass it to build your models.\n",
            "Modules can nest within each other in a tree structure using regular attribute assignment.\n",
            "\u001b[0;31mSource:\u001b[0m        \n",
            "\u001b[0;32mclass\u001b[0m \u001b[0mGPT2Attention\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mModule\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m  \u001b[0;31m# pylint: disable=too-many-instance-attributes\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0;32mdef\u001b[0m \u001b[0m__init__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconfig\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mGPT2Config\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0membed_dim\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mconfig\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_embd\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0;32mif\u001b[0m \u001b[0mconfig\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_head\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mconfig\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtensor_parallel_shards\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34mf\"\u001b[0m\u001b[0;34mCannot split \u001b[0m\u001b[0;34m{\u001b[0m\u001b[0mconfig\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_head\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m attention heads \u001b[0m\u001b[0;34m\"\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34mf\"\u001b[0m\u001b[0;34mevenly to \u001b[0m\u001b[0;34m{\u001b[0m\u001b[0mconfig\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtensor_parallel_shards\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m GPUs.\u001b[0m\u001b[0;34m\"\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnum_heads\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mconfig\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_head\u001b[0m \u001b[0;34m//\u001b[0m \u001b[0mconfig\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtensor_parallel_shards\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhead_dim\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mconfig\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhead_dim\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscale_attn_by_inverse_layer_idx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mconfig\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscale_attn_by_inverse_layer_idx\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mc_attn\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mLinear\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0min_features\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0membed_dim\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0mout_features\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m3\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnum_heads\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhead_dim\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0mbias\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mc_proj\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mLinear\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnum_heads\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhead_dim\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0membed_dim\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbias\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m    \u001b[0;32mdef\u001b[0m \u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhidden_states\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mTensor\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpaged_kv_cache\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mPagedKVCache\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlayer_id\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0md\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mh\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhead_dim\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnum_heads\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0mb\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mhidden_states\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0mqkv\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mc_attn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mhidden_states\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0mqkv\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mop\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mqkv\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mb\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m3\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0mh\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0md\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscale_attn_by_inverse_layer_idx\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0mattn_score_scaling_factor\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m1.0\u001b[0m \u001b[0;34m/\u001b[0m \u001b[0mfloat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlayer_id\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0mattn_score_scaling_factor\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m1.0\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0;31m# Attention\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0moutput\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mop\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0mpaged_kv_cache\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mattention_with_fused_qkv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0mlayer_id\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mqkv\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnum_heads\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mattn_score_scaling_factor\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m(\u001b[0m\u001b[0mb\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mh\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0md\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mc_proj\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
            "\u001b[0;31mFile:\u001b[0m           /media/pc/data/lxw/ai/mlc-llm/python/mlc_llm/model/gpt2/gpt2_model.py\n",
            "\u001b[0;31mType:\u001b[0m           type\n",
            "\u001b[0;31mSubclasses:\u001b[0m     "
          ]
        }
      ],
      "source": [
        "from mlc_llm.model.gpt2.gpt2_model import GPT2Attention\n",
        "GPT2Attention??"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2prefTaZsN9q"
      },
      "source": [
        "请注意，已经提供了一些内置的常用模块，你会发现它们非常方便。例如，这里的 `nn.Linear` 和 `nn.KVCache` 模块都是 MLC-LLM 中的[内置模块](https://github.com/apache/tvm/blob/unity/python/tvm/relax/frontend/nn/modules.py)。\n",
        "\n",
        "同样，也提供了许多常见的对张量进行操作的[内置算子](https://github.com/apache/tvm/blob/unity/python/tvm/relax/frontend/nn/op.py)。例如，`op.reshape`、`op.matmul`、`op.softmax` 等。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 使用 `nn.spec` 定义模型规范\n",
        "\n",
        "一旦验证了模型的每一层行为正确，就可以编写模型规范，将模型从 `nn.module` 转换为 TVM IRModule。\n",
        "\n",
        "在 `get_default_spec` 函数中，需要定义如下模型规范："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "tags": [
          "hide-output"
        ]
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\u001b[0;31mSignature:\u001b[0m \u001b[0mGPT2LMHeadModel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_default_spec\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
            "\u001b[0;31mDocstring:\u001b[0m <no docstring>\n",
            "\u001b[0;31mSource:\u001b[0m   \n",
            "    \u001b[0;32mdef\u001b[0m \u001b[0mget_default_spec\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0mmod_spec\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m\"embed\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"input_ids\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mspec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTensor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"seq_len\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"int32\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"$\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m\"param_mode\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m\"packed\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m\"effect_mode\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m\"none\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m}\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m}\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m\"prefill\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"input_embed\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mspec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTensor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"seq_len\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_embed\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"paged_kv_cache\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mspec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mObject\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobject_type\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mPagedKVCache\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"$\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m\"param_mode\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m\"packed\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m\"effect_mode\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m\"none\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m}\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m}\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m\"decode\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"input_embed\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mspec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTensor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_embed\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"paged_kv_cache\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mspec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mObject\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobject_type\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mPagedKVCache\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"$\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m\"param_mode\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m\"packed\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m\"effect_mode\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m\"none\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m}\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m}\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m\"batch_prefill\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"input_embeds\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mspec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTensor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"seq_len\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_embed\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"logit_positions\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mspec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTensor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"batch_size\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"int32\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"paged_kv_cache\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mspec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mObject\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobject_type\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mPagedKVCache\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"$\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m\"param_mode\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m\"packed\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m\"effect_mode\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m\"none\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m}\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m}\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m\"batch_decode\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"input_embeds\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mspec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTensor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"batch_size\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_embed\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"paged_kv_cache\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mspec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mObject\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobject_type\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mPagedKVCache\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"$\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m\"param_mode\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m\"packed\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m\"effect_mode\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m\"none\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m}\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m}\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m\"batch_verify\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"input_embeds\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mspec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTensor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"seq_len\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_embed\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"paged_kv_cache\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mspec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mObject\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobject_type\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mPagedKVCache\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"$\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m\"param_mode\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m\"packed\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m\"effect_mode\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m\"none\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m}\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m}\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m\"create_paged_kv_cache\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"max_batch_size\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"max_total_seq_len\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"prefill_chunk_size\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"page_size\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"support_sliding_window\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m\"$\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m\"param_mode\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m\"none\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                    \u001b[0;34m\"effect_mode\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m\"none\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m                \u001b[0;34m}\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m            \u001b[0;34m}\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\n",
            "\u001b[0;34m\u001b[0m        \u001b[0;32mreturn\u001b[0m \u001b[0mnn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mspec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mModuleSpec\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfrom_raw\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmod_spec\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
            "\u001b[0;31mFile:\u001b[0m      /media/pc/data/lxw/ai/mlc-llm/python/mlc_llm/model/gpt2/gpt2_model.py\n",
            "\u001b[0;31mType:\u001b[0m      function"
          ]
        }
      ],
      "source": [
        "from mlc_llm.model.gpt2.gpt2_model import GPT2LMHeadModel\n",
        "\n",
        "GPT2LMHeadModel.get_default_spec??"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "所有指定的方法，例如 `embed`、`prefill`、`decode` 等，都将被导出到 TVM IRModule 中。支持 `nn.spec.Tensor`、`nn.spec.Tuple` 和整数作为 relax 函数的输入。\n",
        "\n",
        "\"default\" 和 \"packed\" 调用约定之间的区别如下：![](images/diff.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Qi9QF8eRMtJO"
      },
      "source": [
        "在提供模型规范后，可以使用 `export_tvm` 函数轻松地将 TVM nn.module 转换为 relax Tensor IR。可以查看整个模型的 Tensor IR 表示，以及模型参数名称和数据类型的完整列表。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "u0ee5Go6LB0F",
        "outputId": "bb5b9e6f-ff33-43e9-b1da-34503faca171",
        "tags": [
          "hide-output"
        ]
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "transformer.wte.weight [vocab_size, 768] float32\n",
            "transformer.wpe.weight [1024, 768] float32\n",
            "transformer.h.0.ln_1.weight [768] float32\n",
            "transformer.h.0.ln_1.bias [768] float32\n",
            "transformer.h.0.attn.c_attn.weight [2304, 768] float32\n",
            "transformer.h.0.attn.c_attn.bias [2304] float32\n",
            "transformer.h.0.attn.c_proj.weight [768, 768] float32\n",
            "transformer.h.0.attn.c_proj.bias [768] float32\n",
            "transformer.h.0.ln_2.weight [768] float32\n",
            "transformer.h.0.ln_2.bias [768] float32\n",
            "transformer.h.0.mlp.c_fc.weight [3072, 768] float32\n",
            "transformer.h.0.mlp.c_fc.bias [3072] float32\n",
            "transformer.h.0.mlp.c_proj.weight [768, 3072] float32\n",
            "transformer.h.0.mlp.c_proj.bias [768] float32\n",
            "transformer.h.1.ln_1.weight [768] float32\n",
            "transformer.h.1.ln_1.bias [768] float32\n",
            "transformer.h.1.attn.c_attn.weight [2304, 768] float32\n",
            "transformer.h.1.attn.c_attn.bias [2304] float32\n",
            "transformer.h.1.attn.c_proj.weight [768, 768] float32\n",
            "transformer.h.1.attn.c_proj.bias [768] float32\n",
            "transformer.h.1.ln_2.weight [768] float32\n",
            "transformer.h.1.ln_2.bias [768] float32\n",
            "transformer.h.1.mlp.c_fc.weight [3072, 768] float32\n",
            "transformer.h.1.mlp.c_fc.bias [3072] float32\n",
            "transformer.h.1.mlp.c_proj.weight [768, 3072] float32\n",
            "transformer.h.1.mlp.c_proj.bias [768] float32\n",
            "transformer.h.2.ln_1.weight [768] float32\n",
            "transformer.h.2.ln_1.bias [768] float32\n",
            "transformer.h.2.attn.c_attn.weight [2304, 768] float32\n",
            "transformer.h.2.attn.c_attn.bias [2304] float32\n",
            "transformer.h.2.attn.c_proj.weight [768, 768] float32\n",
            "transformer.h.2.attn.c_proj.bias [768] float32\n",
            "transformer.h.2.ln_2.weight [768] float32\n",
            "transformer.h.2.ln_2.bias [768] float32\n",
            "transformer.h.2.mlp.c_fc.weight [3072, 768] float32\n",
            "transformer.h.2.mlp.c_fc.bias [3072] float32\n",
            "transformer.h.2.mlp.c_proj.weight [768, 3072] float32\n",
            "transformer.h.2.mlp.c_proj.bias [768] float32\n",
            "transformer.h.3.ln_1.weight [768] float32\n",
            "transformer.h.3.ln_1.bias [768] float32\n",
            "transformer.h.3.attn.c_attn.weight [2304, 768] float32\n",
            "transformer.h.3.attn.c_attn.bias [2304] float32\n",
            "transformer.h.3.attn.c_proj.weight [768, 768] float32\n",
            "transformer.h.3.attn.c_proj.bias [768] float32\n",
            "transformer.h.3.ln_2.weight [768] float32\n",
            "transformer.h.3.ln_2.bias [768] float32\n",
            "transformer.h.3.mlp.c_fc.weight [3072, 768] float32\n",
            "transformer.h.3.mlp.c_fc.bias [3072] float32\n",
            "transformer.h.3.mlp.c_proj.weight [768, 3072] float32\n",
            "transformer.h.3.mlp.c_proj.bias [768] float32\n",
            "transformer.h.4.ln_1.weight [768] float32\n",
            "transformer.h.4.ln_1.bias [768] float32\n",
            "transformer.h.4.attn.c_attn.weight [2304, 768] float32\n",
            "transformer.h.4.attn.c_attn.bias [2304] float32\n",
            "transformer.h.4.attn.c_proj.weight [768, 768] float32\n",
            "transformer.h.4.attn.c_proj.bias [768] float32\n",
            "transformer.h.4.ln_2.weight [768] float32\n",
            "transformer.h.4.ln_2.bias [768] float32\n",
            "transformer.h.4.mlp.c_fc.weight [3072, 768] float32\n",
            "transformer.h.4.mlp.c_fc.bias [3072] float32\n",
            "transformer.h.4.mlp.c_proj.weight [768, 3072] float32\n",
            "transformer.h.4.mlp.c_proj.bias [768] float32\n",
            "transformer.h.5.ln_1.weight [768] float32\n",
            "transformer.h.5.ln_1.bias [768] float32\n",
            "transformer.h.5.attn.c_attn.weight [2304, 768] float32\n",
            "transformer.h.5.attn.c_attn.bias [2304] float32\n",
            "transformer.h.5.attn.c_proj.weight [768, 768] float32\n",
            "transformer.h.5.attn.c_proj.bias [768] float32\n",
            "transformer.h.5.ln_2.weight [768] float32\n",
            "transformer.h.5.ln_2.bias [768] float32\n",
            "transformer.h.5.mlp.c_fc.weight [3072, 768] float32\n",
            "transformer.h.5.mlp.c_fc.bias [3072] float32\n",
            "transformer.h.5.mlp.c_proj.weight [768, 3072] float32\n",
            "transformer.h.5.mlp.c_proj.bias [768] float32\n",
            "transformer.h.6.ln_1.weight [768] float32\n",
            "transformer.h.6.ln_1.bias [768] float32\n",
            "transformer.h.6.attn.c_attn.weight [2304, 768] float32\n",
            "transformer.h.6.attn.c_attn.bias [2304] float32\n",
            "transformer.h.6.attn.c_proj.weight [768, 768] float32\n",
            "transformer.h.6.attn.c_proj.bias [768] float32\n",
            "transformer.h.6.ln_2.weight [768] float32\n",
            "transformer.h.6.ln_2.bias [768] float32\n",
            "transformer.h.6.mlp.c_fc.weight [3072, 768] float32\n",
            "transformer.h.6.mlp.c_fc.bias [3072] float32\n",
            "transformer.h.6.mlp.c_proj.weight [768, 3072] float32\n",
            "transformer.h.6.mlp.c_proj.bias [768] float32\n",
            "transformer.h.7.ln_1.weight [768] float32\n",
            "transformer.h.7.ln_1.bias [768] float32\n",
            "transformer.h.7.attn.c_attn.weight [2304, 768] float32\n",
            "transformer.h.7.attn.c_attn.bias [2304] float32\n",
            "transformer.h.7.attn.c_proj.weight [768, 768] float32\n",
            "transformer.h.7.attn.c_proj.bias [768] float32\n",
            "transformer.h.7.ln_2.weight [768] float32\n",
            "transformer.h.7.ln_2.bias [768] float32\n",
            "transformer.h.7.mlp.c_fc.weight [3072, 768] float32\n",
            "transformer.h.7.mlp.c_fc.bias [3072] float32\n",
            "transformer.h.7.mlp.c_proj.weight [768, 3072] float32\n",
            "transformer.h.7.mlp.c_proj.bias [768] float32\n",
            "transformer.h.8.ln_1.weight [768] float32\n",
            "transformer.h.8.ln_1.bias [768] float32\n",
            "transformer.h.8.attn.c_attn.weight [2304, 768] float32\n",
            "transformer.h.8.attn.c_attn.bias [2304] float32\n",
            "transformer.h.8.attn.c_proj.weight [768, 768] float32\n",
            "transformer.h.8.attn.c_proj.bias [768] float32\n",
            "transformer.h.8.ln_2.weight [768] float32\n",
            "transformer.h.8.ln_2.bias [768] float32\n",
            "transformer.h.8.mlp.c_fc.weight [3072, 768] float32\n",
            "transformer.h.8.mlp.c_fc.bias [3072] float32\n",
            "transformer.h.8.mlp.c_proj.weight [768, 3072] float32\n",
            "transformer.h.8.mlp.c_proj.bias [768] float32\n",
            "transformer.h.9.ln_1.weight [768] float32\n",
            "transformer.h.9.ln_1.bias [768] float32\n",
            "transformer.h.9.attn.c_attn.weight [2304, 768] float32\n",
            "transformer.h.9.attn.c_attn.bias [2304] float32\n",
            "transformer.h.9.attn.c_proj.weight [768, 768] float32\n",
            "transformer.h.9.attn.c_proj.bias [768] float32\n",
            "transformer.h.9.ln_2.weight [768] float32\n",
            "transformer.h.9.ln_2.bias [768] float32\n",
            "transformer.h.9.mlp.c_fc.weight [3072, 768] float32\n",
            "transformer.h.9.mlp.c_fc.bias [3072] float32\n",
            "transformer.h.9.mlp.c_proj.weight [768, 3072] float32\n",
            "transformer.h.9.mlp.c_proj.bias [768] float32\n",
            "transformer.h.10.ln_1.weight [768] float32\n",
            "transformer.h.10.ln_1.bias [768] float32\n",
            "transformer.h.10.attn.c_attn.weight [2304, 768] float32\n",
            "transformer.h.10.attn.c_attn.bias [2304] float32\n",
            "transformer.h.10.attn.c_proj.weight [768, 768] float32\n",
            "transformer.h.10.attn.c_proj.bias [768] float32\n",
            "transformer.h.10.ln_2.weight [768] float32\n",
            "transformer.h.10.ln_2.bias [768] float32\n",
            "transformer.h.10.mlp.c_fc.weight [3072, 768] float32\n",
            "transformer.h.10.mlp.c_fc.bias [3072] float32\n",
            "transformer.h.10.mlp.c_proj.weight [768, 3072] float32\n",
            "transformer.h.10.mlp.c_proj.bias [768] float32\n",
            "transformer.h.11.ln_1.weight [768] float32\n",
            "transformer.h.11.ln_1.bias [768] float32\n",
            "transformer.h.11.attn.c_attn.weight [2304, 768] float32\n",
            "transformer.h.11.attn.c_attn.bias [2304] float32\n",
            "transformer.h.11.attn.c_proj.weight [768, 768] float32\n",
            "transformer.h.11.attn.c_proj.bias [768] float32\n",
            "transformer.h.11.ln_2.weight [768] float32\n",
            "transformer.h.11.ln_2.bias [768] float32\n",
            "transformer.h.11.mlp.c_fc.weight [3072, 768] float32\n",
            "transformer.h.11.mlp.c_fc.bias [3072] float32\n",
            "transformer.h.11.mlp.c_proj.weight [768, 3072] float32\n",
            "transformer.h.11.mlp.c_proj.bias [768] float32\n",
            "transformer.ln_f.weight [768] float32\n",
            "transformer.ln_f.bias [768] float32\n",
            "lm_head.weight [vocab_size, 768] float32\n"
          ]
        }
      ],
      "source": [
        "from mlc_llm.model.gpt2 import gpt2_model\n",
        "\n",
        "config_dict = {\n",
        "    \"architectures\": [\"GPT2LMHeadModel\"],\n",
        "    \"bos_token_id\": 50256,\n",
        "    \"eos_token_id\": 50256,\n",
        "    \"hidden_act\": \"gelu_new\",\n",
        "    \"n_ctx\": 1024,\n",
        "    \"n_embd\": 768,\n",
        "    \"n_head\": 12,\n",
        "    \"n_layer\": 12,\n",
        "    \"n_positions\": 1024,\n",
        "    \"layer_norm_epsilon\": 1e-05,\n",
        "    \"scale_attn_by_inverse_layer_idx\": False,\n",
        "    \"vocab_size\": 50257,\n",
        "}\n",
        "\n",
        "config = gpt2_model.GPT2Config.from_dict(config_dict)\n",
        "model = gpt2_model.GPT2LMHeadModel(config)\n",
        "mod, named_params = model.export_tvm(\n",
        "    spec=model.get_default_spec(),\n",
        ")\n",
        "\n",
        "# Uncomment the following line to show the model in Tensor IR\n",
        "# mod.show(black_format=False)\n",
        "\n",
        "for name, param in named_params:\n",
        "    print(name, param.shape, param.dtype)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iR7ijB5WNiML"
      },
      "source": [
        "### 在 `gpt2_loader.py` 中定义加载器\n",
        "\n",
        "在 `gpt2_loader.py` 中，定义了如何将 Huggingface 的参数转换为 MLC 模型所使用的格式。\n",
        "\n",
        "加载器类将返回 [`ExternMapping`](https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/loader/mapping.py)，其中包含两种映射：\n",
        "- **源 -> MLC 参数映射**：例如参数重命名、参数转换等。\n",
        "- **未使用的映射**：源中未在 MLC 模型定义中使用的参数。\n",
        "\n",
        "在 GPT-2 中，由于使用了 Conv1D，需要对 `c_attn`、`c_proj` 和 `c_fc` 的权重进行转置。为此，将提供映射函数，如下所示：\n",
        "\n",
        "```python\n",
        "for conv1d_weight_name in [\"attn.c_attn\", \"attn.c_proj\", \"mlp.c_proj\", \"mlp.c_fc\"]:\n",
        "    src_name = f\"h.{i}.{conv1d_weight_name}.weight\"\n",
        "    mlc_name = f\"transformer.{src_name}\"\n",
        "    mapping.add_mapping(\n",
        "        mlc_name,\n",
        "        [src_name],\n",
        "        functools.partial(\n",
        "            lambda x, dtype: x.transpose().astype(dtype),\n",
        "            dtype=named_parameters[mlc_name].dtype,\n",
        "        ),\n",
        "    )\n",
        "```\n",
        "\n",
        "为了使 GPT-2 参数转换正常工作，还需要进行一些重命名操作。请参考[gpt2_loader.py](https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/model/gpt2/gpt2_loader.py)。\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Nekb6Ku4V3F2"
      },
      "source": [
        "## 将模型添加到支持的预构建模型工作流\n",
        "\n",
        "一旦整个模型在 TVM 的 `nn.module` 中定义完毕，包括模型架构、模型加载器和模型量化器，就可以将其添加到支持的预构建模型工作流中。\n",
        "\n",
        "在[`mlc-llm/python/mlc_llm/model/model.py`](https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/model/model.py)中，将GPT-2模型添加到 `MODELS` 列表中：\n",
        "\n",
        "```python\n",
        "\"gpt2\": Model(\n",
        "    name=\"gpt2\",\n",
        "    model=gpt2_model.GPT2LMHeadModel,\n",
        "    config=gpt2_model.GPT2Config,\n",
        "    source={\n",
        "        \"huggingface-torch\": gpt2_loader.huggingface,\n",
        "        \"huggingface-safetensor\": gpt2_loader.huggingface,\n",
        "    },\n",
        "    quantize={\n",
        "        \"no-quant\": gpt2_quantization.no_quant,\n",
        "        \"group-quant\": gpt2_quantization.group_quant,\n",
        "    },\n",
        ")\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AwsQF8zEYh8H"
      },
      "source": [
        "## 编译 GPT-2 模型库和权重\n",
        "\n",
        "以下步骤与[通用模型编译工作流](https://llm.mlc.ai/docs/compilation/compile_models.html)相同。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "EWLOmEXyZV9q",
        "outputId": "3beb8d58-d067-4eed-c114-27e83a1149cd"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "/media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/models\n",
            "Updated git hooks.\n",
            "Git LFS initialized.\n",
            "正克隆到 'gpt2'...\n",
            "remote: Enumerating objects: 87, done.\u001b[K\n",
            "remote: Counting objects: 100% (3/3), done.\u001b[K\n",
            "remote: Compressing objects: 100% (2/2), done.\u001b[K\n",
            "remote: Total 87 (delta 0), reused 0 (delta 0), pack-reused 84 (from 1)\u001b[K\n",
            "展开对象中: 100% (87/87), 1.65 MiB | 38.00 KiB/s, 完成.\n",
            "过滤内容: 100% (11/11), 5.23 GiB | 2.64 MiB/s, 完成.\n",
            "/media/pc/data/lxw/ai/tvm-book/tests/.temp\n"
          ]
        }
      ],
      "source": [
        "# Create directory\n",
        "!mkdir -p {temp_dir}/dist/models\n",
        "%cd {temp_dir}/dist/models\n",
        "\n",
        "# Clone HF weights\n",
        "!git lfs install\n",
        "# git clone https://huggingface.co/openai-community/gpt2\n",
        "# git clone git@hf.co:openai-community/gpt2\n",
        "!git clone https://hf-mirror.com/openai-community/gpt2\n",
        "%cd ../.."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "bzN9QbUpYeot",
        "outputId": "cb274ff6-0a84-4e78-e0ee-379504fff432",
        "tags": [
          "hide-output"
        ]
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "[2025-01-07 11:11:43] INFO auto_config.py:116: \u001b[92mFound\u001b[0m model configuration: /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/models/gpt2/config.json\n",
            "[2025-01-07 11:11:46] INFO auto_device.py:79: \u001b[92mFound\u001b[0m device: cuda:0\n",
            "[2025-01-07 11:11:46] INFO auto_device.py:79: \u001b[92mFound\u001b[0m device: cuda:1\n",
            "[2025-01-07 11:11:46] INFO auto_weight.py:71: Finding weights in: /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/models/gpt2\n",
            "[2025-01-07 11:11:46] INFO auto_weight.py:130: \u001b[92mFound\u001b[0m source weight format: huggingface-torch. Source configuration: /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/models/gpt2/pytorch_model.bin\n",
            "[2025-01-07 11:11:49] INFO auto_weight.py:161: \u001b[92mFound\u001b[0m source weight format: huggingface-safetensor. Source configuration: /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/models/gpt2/model.safetensors.index.json\n",
            "[2025-01-07 11:11:49] INFO auto_weight.py:107: Using source weight configuration: \u001b[1m/media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/models/gpt2/pytorch_model.bin\u001b[0m. Use `--source` to override.\n",
            "[2025-01-07 11:11:49] INFO auto_weight.py:111: Using source weight format: \u001b[1mhuggingface-torch\u001b[0m. Use `--source-format` to override.\n",
            "[2025-01-07 11:11:49] INFO auto_config.py:154: \u001b[92mFound\u001b[0m model type: \u001b[1mgpt2\u001b[0m. Use `--model-type` to override.\n",
            "\u001b[1mWeight conversion with arguments:\u001b[0m\n",
            "  \u001b[1m--config\u001b[0m          /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/models/gpt2/config.json\n",
            "  \u001b[1m--quantization\u001b[0m    NoQuantize(name='q0f16', kind='no-quant', model_dtype='float16')\n",
            "  \u001b[1m--model-type\u001b[0m      gpt2\n",
            "  \u001b[1m--device\u001b[0m          cuda:0\n",
            "  \u001b[1m--source\u001b[0m          /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/models/gpt2/pytorch_model.bin\n",
            "  \u001b[1m--source-format\u001b[0m   huggingface-torch\n",
            "  \u001b[1m--output\u001b[0m          /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/gpt2-q0f16-MLC\n",
            "[2025-01-07 11:11:49] INFO gpt2_model.py:47: \u001b[1mcontext_window_size\u001b[0m not found in config.json. Falling back to \u001b[1mn_positions\u001b[0m (1024)\n",
            "[2025-01-07 11:11:49] INFO gpt2_model.py:64: \u001b[1mprefill_chunk_size\u001b[0m defaults to 1024\n",
            "Start storing to cache /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/gpt2-q0f16-MLC\n",
            "[2025-01-07 11:11:52] INFO huggingface_loader.py:185: Loading HF parameters from: /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/models/gpt2/pytorch_model.bin\n",
            "/media/pc/data/lxw/ai/mlc-llm/python/mlc_llm/loader/utils.py:43: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
            "  for name, param in torch.load(path, map_location=torch.device(\"cpu\")).items():\n",
            "[2025-01-07 11:11:53] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mlm_head.weight\u001b[0m\", shape: (50257, 768), dtype: float16\n",
            "[2025-01-07 11:11:53] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.wte.weight\u001b[0m\", shape: (50257, 768), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.wpe.weight\u001b[0m\", shape: (1024, 768), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.0.ln_1.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.0.ln_1.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.0.attn.c_attn.weight\u001b[0m\", shape: (2304, 768), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.0.attn.c_attn.bias\u001b[0m\", shape: (2304,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.0.attn.c_proj.weight\u001b[0m\", shape: (768, 768), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.0.attn.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.0.ln_2.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.0.ln_2.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.0.mlp.c_fc.weight\u001b[0m\", shape: (3072, 768), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.0.mlp.c_fc.bias\u001b[0m\", shape: (3072,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.0.mlp.c_proj.weight\u001b[0m\", shape: (768, 3072), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.0.mlp.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.1.ln_1.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.1.ln_1.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.1.attn.c_attn.weight\u001b[0m\", shape: (2304, 768), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.1.attn.c_attn.bias\u001b[0m\", shape: (2304,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.1.attn.c_proj.weight\u001b[0m\", shape: (768, 768), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.1.attn.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.1.ln_2.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.1.ln_2.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.1.mlp.c_fc.weight\u001b[0m\", shape: (3072, 768), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.1.mlp.c_fc.bias\u001b[0m\", shape: (3072,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.1.mlp.c_proj.weight\u001b[0m\", shape: (768, 3072), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.1.mlp.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.2.ln_1.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.2.ln_1.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.2.attn.c_attn.weight\u001b[0m\", shape: (2304, 768), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.2.attn.c_attn.bias\u001b[0m\", shape: (2304,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.2.attn.c_proj.weight\u001b[0m\", shape: (768, 768), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.2.attn.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.2.ln_2.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.2.ln_2.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.2.mlp.c_fc.weight\u001b[0m\", shape: (3072, 768), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.2.mlp.c_fc.bias\u001b[0m\", shape: (3072,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.2.mlp.c_proj.weight\u001b[0m\", shape: (768, 3072), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.2.mlp.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.3.ln_1.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.3.ln_1.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.3.attn.c_attn.weight\u001b[0m\", shape: (2304, 768), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.3.attn.c_attn.bias\u001b[0m\", shape: (2304,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.3.attn.c_proj.weight\u001b[0m\", shape: (768, 768), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.3.attn.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.3.ln_2.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.3.ln_2.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.3.mlp.c_fc.weight\u001b[0m\", shape: (3072, 768), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.3.mlp.c_fc.bias\u001b[0m\", shape: (3072,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.3.mlp.c_proj.weight\u001b[0m\", shape: (768, 3072), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.3.mlp.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.4.ln_1.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.4.ln_1.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.4.attn.c_attn.weight\u001b[0m\", shape: (2304, 768), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.4.attn.c_attn.bias\u001b[0m\", shape: (2304,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.4.attn.c_proj.weight\u001b[0m\", shape: (768, 768), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.4.attn.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.4.ln_2.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.4.ln_2.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.4.mlp.c_fc.weight\u001b[0m\", shape: (3072, 768), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.4.mlp.c_fc.bias\u001b[0m\", shape: (3072,), dtype: float16\n",
            "[2025-01-07 11:11:54] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.4.mlp.c_proj.weight\u001b[0m\", shape: (768, 3072), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.4.mlp.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.5.ln_1.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.5.ln_1.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.5.attn.c_attn.weight\u001b[0m\", shape: (2304, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.5.attn.c_attn.bias\u001b[0m\", shape: (2304,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.5.attn.c_proj.weight\u001b[0m\", shape: (768, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.5.attn.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.5.ln_2.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.5.ln_2.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.5.mlp.c_fc.weight\u001b[0m\", shape: (3072, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.5.mlp.c_fc.bias\u001b[0m\", shape: (3072,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.5.mlp.c_proj.weight\u001b[0m\", shape: (768, 3072), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.5.mlp.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.6.ln_1.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.6.ln_1.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.6.attn.c_attn.weight\u001b[0m\", shape: (2304, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.6.attn.c_attn.bias\u001b[0m\", shape: (2304,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.6.attn.c_proj.weight\u001b[0m\", shape: (768, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.6.attn.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.6.ln_2.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.6.ln_2.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.6.mlp.c_fc.weight\u001b[0m\", shape: (3072, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.6.mlp.c_fc.bias\u001b[0m\", shape: (3072,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.6.mlp.c_proj.weight\u001b[0m\", shape: (768, 3072), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.6.mlp.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.7.ln_1.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.7.ln_1.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.7.attn.c_attn.weight\u001b[0m\", shape: (2304, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.7.attn.c_attn.bias\u001b[0m\", shape: (2304,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.7.attn.c_proj.weight\u001b[0m\", shape: (768, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.7.attn.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.7.ln_2.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.7.ln_2.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.7.mlp.c_fc.weight\u001b[0m\", shape: (3072, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.7.mlp.c_fc.bias\u001b[0m\", shape: (3072,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.7.mlp.c_proj.weight\u001b[0m\", shape: (768, 3072), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.7.mlp.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.8.ln_1.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.8.ln_1.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.8.attn.c_attn.weight\u001b[0m\", shape: (2304, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.8.attn.c_attn.bias\u001b[0m\", shape: (2304,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.8.attn.c_proj.weight\u001b[0m\", shape: (768, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.8.attn.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.8.ln_2.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.8.ln_2.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.8.mlp.c_fc.weight\u001b[0m\", shape: (3072, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.8.mlp.c_fc.bias\u001b[0m\", shape: (3072,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.8.mlp.c_proj.weight\u001b[0m\", shape: (768, 3072), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.8.mlp.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.9.ln_1.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.9.ln_1.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.9.attn.c_attn.weight\u001b[0m\", shape: (2304, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.9.attn.c_attn.bias\u001b[0m\", shape: (2304,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.9.attn.c_proj.weight\u001b[0m\", shape: (768, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.9.attn.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.9.ln_2.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.9.ln_2.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.9.mlp.c_fc.weight\u001b[0m\", shape: (3072, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.9.mlp.c_fc.bias\u001b[0m\", shape: (3072,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.9.mlp.c_proj.weight\u001b[0m\", shape: (768, 3072), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.9.mlp.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.10.ln_1.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.10.ln_1.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.10.attn.c_attn.weight\u001b[0m\", shape: (2304, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.10.attn.c_attn.bias\u001b[0m\", shape: (2304,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.10.attn.c_proj.weight\u001b[0m\", shape: (768, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.10.attn.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.10.ln_2.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.10.ln_2.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.10.mlp.c_fc.weight\u001b[0m\", shape: (3072, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.10.mlp.c_fc.bias\u001b[0m\", shape: (3072,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.10.mlp.c_proj.weight\u001b[0m\", shape: (768, 3072), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.10.mlp.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.11.ln_1.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.11.ln_1.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.11.attn.c_attn.weight\u001b[0m\", shape: (2304, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.11.attn.c_attn.bias\u001b[0m\", shape: (2304,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.11.attn.c_proj.weight\u001b[0m\", shape: (768, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.11.attn.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.11.ln_2.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.11.ln_2.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.11.mlp.c_fc.weight\u001b[0m\", shape: (3072, 768), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.11.mlp.c_fc.bias\u001b[0m\", shape: (3072,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.11.mlp.c_proj.weight\u001b[0m\", shape: (768, 3072), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.h.11.mlp.c_proj.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.ln_f.weight\u001b[0m\", shape: (768,), dtype: float16\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:175: [Not quantized] Parameter: \"\u001b[1mtransformer.ln_f.bias\u001b[0m\", shape: (768,), dtype: float16\n",
            "100%|█████████████████████████████████████████| 149/149 [00:02<00:00, 51.89it/s]\n",
            "[2025-01-07 11:11:55] INFO huggingface_loader.py:197: Unloading HF weight file: /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/models/gpt2/pytorch_model.bin\n",
            "[2025-01-07 11:11:56] INFO stats.py:77: \u001b[92mTime usage\u001b[0m: HF loading: 0.481 sec; Pre-quantization mapping: 0.920 sec; Quantization: 0.000 sec\n",
            "[2025-01-07 11:11:56] INFO stats.py:91: \u001b[92mRAM usage\u001b[0m: Peak RAM: 0.510 GB. Total bytes loaded from disk: 0.510 GB\n",
            "\n",
            "All finished, 8 total shards committed, record saved to /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/gpt2-q0f16-MLC/ndarray-cache.json\n",
            "[2025-01-07 11:11:56] INFO convert_weight.py:155: \u001b[92mParameter size\u001b[0m after quantization: 0.304 GB\n",
            "[2025-01-07 11:11:56] INFO convert_weight.py:160: \u001b[92mTotal parameters\u001b[0m: 124,439,808\n",
            "[2025-01-07 11:11:56] INFO convert_weight.py:161: \u001b[92mBits per parameter\u001b[0m: 20.963\n",
            "[2025-01-07 11:11:56] INFO convert_weight.py:166: Saved to directory: \u001b[1m/media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/gpt2-q0f16-MLC\u001b[0m\n"
          ]
        }
      ],
      "source": [
        "# Convert weight\n",
        "!python -m mlc_llm convert_weight {temp_dir}/dist/models/gpt2/ --device cuda --quantization q0f16 -o {temp_dir}/dist/gpt2-q0f16-MLC"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "1. gen_config: 生成 `mlc-chat-config.json` 并处理分词器"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 17,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "QJg-pMTGab2c",
        "outputId": "1c7af629-8611-4f85-840b-3a0f275960a9",
        "tags": [
          "hide-output"
        ]
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "[2025-01-07 11:12:41] INFO auto_config.py:116: \u001b[92mFound\u001b[0m model configuration: /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/models/gpt2/config.json\n",
            "[2025-01-07 11:12:41] INFO auto_config.py:154: \u001b[92mFound\u001b[0m model type: \u001b[1mgpt2\u001b[0m. Use `--model-type` to override.\n",
            "[2025-01-07 11:12:41] INFO gpt2_model.py:47: \u001b[1mcontext_window_size\u001b[0m not found in config.json. Falling back to \u001b[1mn_positions\u001b[0m (1024)\n",
            "[2025-01-07 11:12:41] INFO gpt2_model.py:64: \u001b[1mprefill_chunk_size\u001b[0m defaults to 1024\n",
            "[2025-01-07 11:12:41] INFO config.py:107: Overriding \u001b[1mmax_batch_size\u001b[0m from 1 to 128\n",
            "[2025-01-07 11:12:41] INFO gen_config.py:150: [generation_config.json] Setting \u001b[1mbos_token_id\u001b[0m: 50256\n",
            "[2025-01-07 11:12:41] INFO gen_config.py:150: [generation_config.json] Setting \u001b[1meos_token_id\u001b[0m: 50256\n",
            "[2025-01-07 11:12:41] INFO gen_config.py:164: \u001b[91mNot found\u001b[0m tokenizer config: /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/models/gpt2/tokenizer.model\n",
            "[2025-01-07 11:12:41] INFO gen_config.py:162: \u001b[92mFound\u001b[0m tokenizer config: /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/models/gpt2/tokenizer.json. Copying to \u001b[1m/media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/gpt2-q0f16-MLC/tokenizer.json\u001b[0m\n",
            "[2025-01-07 11:12:41] INFO gen_config.py:162: \u001b[92mFound\u001b[0m tokenizer config: /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/models/gpt2/vocab.json. Copying to \u001b[1m/media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/gpt2-q0f16-MLC/vocab.json\u001b[0m\n",
            "[2025-01-07 11:12:41] INFO gen_config.py:162: \u001b[92mFound\u001b[0m tokenizer config: /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/models/gpt2/merges.txt. Copying to \u001b[1m/media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/gpt2-q0f16-MLC/merges.txt\u001b[0m\n",
            "[2025-01-07 11:12:41] INFO gen_config.py:164: \u001b[91mNot found\u001b[0m tokenizer config: /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/models/gpt2/added_tokens.json\n",
            "[2025-01-07 11:12:41] INFO gen_config.py:162: \u001b[92mFound\u001b[0m tokenizer config: /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/models/gpt2/tokenizer_config.json. Copying to \u001b[1m/media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/gpt2-q0f16-MLC/tokenizer_config.json\u001b[0m\n",
            "[2025-01-07 11:12:41] INFO gen_config.py:223: Detected tokenizer info: {'token_postproc_method': 'byte_level', 'prepend_space_in_encode': False, 'strip_space_in_decode': False}\n",
            "[2025-01-07 11:12:41] INFO gen_config.py:32: [System default] Setting \u001b[1mpad_token_id\u001b[0m: 0\n",
            "[2025-01-07 11:12:41] INFO gen_config.py:32: [System default] Setting \u001b[1mtemperature\u001b[0m: 1.0\n",
            "[2025-01-07 11:12:41] INFO gen_config.py:32: [System default] Setting \u001b[1mpresence_penalty\u001b[0m: 0.0\n",
            "[2025-01-07 11:12:41] INFO gen_config.py:32: [System default] Setting \u001b[1mfrequency_penalty\u001b[0m: 0.0\n",
            "[2025-01-07 11:12:41] INFO gen_config.py:32: [System default] Setting \u001b[1mrepetition_penalty\u001b[0m: 1.0\n",
            "[2025-01-07 11:12:41] INFO gen_config.py:32: [System default] Setting \u001b[1mtop_p\u001b[0m: 1.0\n",
            "[2025-01-07 11:12:41] INFO gen_config.py:251: Dumping configuration file to: \u001b[1m/media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/gpt2-q0f16-MLC/mlc-chat-config.json\u001b[0m\n"
          ]
        }
      ],
      "source": [
        "!python -m mlc_llm gen_config {temp_dir}/dist/models/gpt2 --quantization q0f16 --conv-template gpt2 -o {temp_dir}/dist/gpt2-q0f16-MLC/"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "2. 编译：根据 `mlc-chat-config.json` 中的规范编译模型库"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 18,
      "metadata": {
        "tags": [
          "hide-output"
        ]
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "[2025-01-07 11:13:02] INFO auto_config.py:70: \u001b[92mFound\u001b[0m model configuration: /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/gpt2-q0f16-MLC/mlc-chat-config.json\n",
            "[2025-01-07 11:13:05] INFO auto_device.py:79: \u001b[92mFound\u001b[0m device: cuda:0\n",
            "[2025-01-07 11:13:05] INFO auto_device.py:79: \u001b[92mFound\u001b[0m device: cuda:1\n",
            "[2025-01-07 11:13:05] INFO auto_target.py:78: \u001b[92mFound\u001b[0m configuration of target device \"\u001b[1mcuda:0\u001b[0m\": {\"thread_warp_size\": runtime.BoxInt(32), \"arch\": \"sm_86\", \"max_threads_per_block\": runtime.BoxInt(1024), \"max_num_threads\": runtime.BoxInt(1024), \"kind\": \"cuda\", \"max_shared_memory_per_block\": runtime.BoxInt(49152), \"tag\": \"\", \"keys\": [\"cuda\", \"gpu\"]}\n",
            "[2025-01-07 11:13:05] INFO auto_target.py:110: \u001b[92mFound\u001b[0m host LLVM triple: \u001b[1mx86_64-unknown-linux-gnu\u001b[0m\n",
            "[2025-01-07 11:13:05] INFO auto_target.py:111: \u001b[92mFound\u001b[0m host LLVM CPU: \u001b[1mhaswell\u001b[0m\n",
            "[2025-01-07 11:13:05] INFO auto_target.py:334: Generating code for CUDA architecture: \u001b[1msm_86\u001b[0m\n",
            "[2025-01-07 11:13:05] INFO auto_target.py:335: To produce multi-arch fatbin, set environment variable \u001b[1mMLC_MULTI_ARCH\u001b[0m. Example: MLC_MULTI_ARCH=70,72,75,80,86,87,89,90a\n",
            "[2025-01-07 11:13:05] INFO auto_config.py:154: \u001b[92mFound\u001b[0m model type: \u001b[1mgpt2\u001b[0m. Use `--model-type` to override.\n",
            "\u001b[1mCompiling with arguments:\u001b[0m\n",
            "  \u001b[1m--config\u001b[0m          GPT2Config(vocab_size=50257, n_embd=768, n_layer=12, n_head=12, layer_norm_epsilon=1e-05, n_inner=3072, context_window_size=1024, prefill_chunk_size=1024, scale_attn_by_inverse_layer_idx=False, tensor_parallel_shards=1, head_dim=64, max_batch_size=128, kwargs={})\n",
            "  \u001b[1m--quantization\u001b[0m    NoQuantize(name='q0f16', kind='no-quant', model_dtype='float16')\n",
            "  \u001b[1m--model-type\u001b[0m      gpt2\n",
            "  \u001b[1m--target\u001b[0m          {\"thread_warp_size\": runtime.BoxInt(32), \"host\": {\"mtriple\": \"x86_64-unknown-linux-gnu\", \"tag\": \"\", \"kind\": \"llvm\", \"mcpu\": \"haswell\", \"keys\": [\"cpu\"]}, \"arch\": \"sm_86\", \"max_threads_per_block\": runtime.BoxInt(1024), \"libs\": [\"thrust\"], \"max_num_threads\": runtime.BoxInt(1024), \"kind\": \"cuda\", \"max_shared_memory_per_block\": runtime.BoxInt(49152), \"tag\": \"\", \"keys\": [\"cuda\", \"gpu\"]}\n",
            "  \u001b[1m--opt\u001b[0m             flashinfer=0;cublas_gemm=1;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=AUTO\n",
            "  \u001b[1m--system-lib-prefix\u001b[0m \"\"\n",
            "  \u001b[1m--output\u001b[0m          /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/gpt2-q0f16-MLC/gpt2-q0f16-cuda.so\n",
            "  \u001b[1m--overrides\u001b[0m       context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None;pipeline_parallel_stages=None;disaggregation=None\n",
            "[2025-01-07 11:13:05] INFO compile.py:140: Creating model from: GPT2Config(vocab_size=50257, n_embd=768, n_layer=12, n_head=12, layer_norm_epsilon=1e-05, n_inner=3072, context_window_size=1024, prefill_chunk_size=1024, scale_attn_by_inverse_layer_idx=False, tensor_parallel_shards=1, head_dim=64, max_batch_size=128, kwargs={})\n",
            "[2025-01-07 11:13:05] INFO compile.py:158: Exporting the model to TVM Unity compiler\n",
            "[2025-01-07 11:13:07] INFO compile.py:164: Running optimizations using TVM Unity\n",
            "[2025-01-07 11:13:07] INFO compile.py:186: Registering metadata: {'model_type': 'gpt2', 'quantization': 'q0f16', 'context_window_size': 1024, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 1024, 'tensor_parallel_shards': 1, 'pipeline_parallel_stages': 1, 'disaggregation': False, 'kv_state_kind': 'kv_cache', 'max_batch_size': 128}\n",
            "[2025-01-07 11:13:15] INFO pipeline.py:55: Running TVM Relax graph-level optimizations\n",
            "[2025-01-07 11:13:18] INFO pipeline.py:55: Lowering to TVM TIR kernels\n",
            "[2025-01-07 11:13:21] WARNING thrust.py:25: thrust is requested but TVM is not built with thrust.\n",
            "[2025-01-07 11:13:21] WARNING thrust.py:25: thrust is requested but TVM is not built with thrust.\n",
            "[2025-01-07 11:13:24] INFO pipeline.py:55: Running TVM TIR-level optimizations\n",
            "[2025-01-07 11:13:28] INFO pipeline.py:55: Running TVM Dlight low-level optimizations\n",
            "[2025-01-07 11:13:31] INFO pipeline.py:55: Lowering to VM bytecode\n",
            "[2025-01-07 11:13:33] INFO estimate_memory_usage.py:58: [Memory usage] Function `alloc_embedding_tensor`: 1.50 MB\n",
            "[2025-01-07 11:13:33] INFO estimate_memory_usage.py:58: [Memory usage] Function `argsort_probs`: 0.00 MB\n",
            "[2025-01-07 11:13:33] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode`: 1.69 MB\n",
            "[2025-01-07 11:13:33] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill`: 13.69 MB\n",
            "[2025-01-07 11:13:33] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify`: 13.50 MB\n",
            "[2025-01-07 11:13:33] INFO estimate_memory_usage.py:58: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB\n",
            "[2025-01-07 11:13:33] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode`: 0.01 MB\n",
            "[2025-01-07 11:13:33] INFO estimate_memory_usage.py:58: [Memory usage] Function `embed`: 1.50 MB\n",
            "[2025-01-07 11:13:33] INFO estimate_memory_usage.py:58: [Memory usage] Function `multinomial_from_uniform`: 0.00 MB\n",
            "[2025-01-07 11:13:33] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill`: 13.50 MB\n",
            "[2025-01-07 11:13:33] INFO estimate_memory_usage.py:58: [Memory usage] Function `renormalize_by_top_p`: 0.00 MB\n",
            "[2025-01-07 11:13:33] INFO estimate_memory_usage.py:58: [Memory usage] Function `sample_with_top_p`: 0.00 MB\n",
            "[2025-01-07 11:13:33] INFO estimate_memory_usage.py:58: [Memory usage] Function `sampler_take_probs`: 0.01 MB\n",
            "[2025-01-07 11:13:33] INFO estimate_memory_usage.py:58: [Memory usage] Function `sampler_verify_draft_tokens`: 0.00 MB\n",
            "[2025-01-07 11:13:33] INFO estimate_memory_usage.py:58: [Memory usage] Function `softmax_with_temperature`: 0.00 MB\n",
            "[2025-01-07 11:13:34] INFO pipeline.py:55: Compiling external modules\n",
            "[2025-01-07 11:13:34] INFO pipeline.py:55: Compilation complete! Exporting to disk\n",
            "Traceback (most recent call last):\n",
            "  File \"/media/pc/data/lxw/ai/mlc-llm/python/mlc_llm/interface/compile.py\", line 189, in _compile\n",
            "    args.build_func(\n",
            "  File \"/media/pc/data/lxw/ai/mlc-llm/python/mlc_llm/support/auto_target.py\", line 301, in build\n",
            "    relax.build(\n",
            "  File \"/media/pc/data/lxw/ai/tvm/python/tvm/relax/vm_build.py\", line 353, in build\n",
            "    return _vmlink(\n",
            "           ^^^^^^^^\n",
            "  File \"/media/pc/data/lxw/ai/tvm/python/tvm/relax/vm_build.py\", line 249, in _vmlink\n",
            "    lib = tvm.build(\n",
            "          ^^^^^^^^^^\n",
            "  File \"/media/pc/data/lxw/ai/tvm/python/tvm/driver/build_module.py\", line 297, in build\n",
            "    rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)\n",
            "                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
            "  File \"/media/pc/data/lxw/ai/tvm/python/tvm/_ffi/_ctypes/packed_func.py\", line 245, in __call__\n",
            "    raise_last_ffi_error()\n",
            "  File \"/media/pc/data/lxw/ai/tvm/python/tvm/_ffi/base.py\", line 481, in raise_last_ffi_error\n",
            "    raise py_err\n",
            "  File \"/media/pc/data/lxw/ai/tvm/src/driver/driver_api.cc\", line 531, in operator()\n",
            "    return TIRToRuntime(inputs_arg, host_target);\n",
            "                ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
            "  File \"/media/pc/data/lxw/ai/tvm/src/driver/driver_api.cc\", line 514, in tvm::TIRToRuntime(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target const&)\n",
            "    device_modules.push_back(codegen::Build(device_mod, it.first));\n",
            "              ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
            "  File \"/media/pc/data/lxw/ai/tvm/src/target/codegen.cc\", line 73, in tvm::codegen::Build(tvm::IRModule, tvm::Target)\n",
            "    return (*bf)(mod, target);\n",
            "                    ^^^^^^^^^^^\n",
            "  File \"/media/pc/data/lxw/ai/tvm/src/target/opt/build_cuda_on.cc\", line 161, in tvm::codegen::BuildCUDA(tvm::IRModule, tvm::Target)\n",
            "    ptx = (*f)(code, target).operator std::string();\n",
            "                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
            "  File \"/media/pc/data/lxw/ai/mlc-llm/python/mlc_llm/support/auto_target.py\", line 352, in tvm_callback_cuda_compile\n",
            "    ptx = nvcc.compile_cuda(code, target_format=\"fatbin\")\n",
            "          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
            "  File \"/media/pc/data/lxw/ai/tvm/python/tvm/contrib/nvcc.py\", line 120, in compile_cuda\n",
            "    proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)\n",
            "           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
            "  File \"/media/pc/data/lxw/envs/anaconda3a/envs/ai/lib/python3.12/subprocess.py\", line 1026, in __init__\n",
            "    self._execute_child(args, executable, preexec_fn, close_fds,\n",
            "  File \"/media/pc/data/lxw/envs/anaconda3a/envs/ai/lib/python3.12/subprocess.py\", line 1953, in _execute_child\n",
            "    raise child_exception_type(errno_num, err_msg, err_filename)\n",
            "FileNotFoundError: [Errno 2] No such file or directory: 'nvcc'\n",
            "\n",
            "During handling of the above exception, another exception occurred:\n",
            "\n",
            "Traceback (most recent call last):\n",
            "  File \"<frozen runpy>\", line 198, in _run_module_as_main\n",
            "  File \"<frozen runpy>\", line 88, in _run_code\n",
            "  File \"/media/pc/data/lxw/ai/mlc-llm/python/mlc_llm/__main__.py\", line 69, in <module>\n",
            "    main()\n",
            "  File \"/media/pc/data/lxw/ai/mlc-llm/python/mlc_llm/__main__.py\", line 34, in main\n",
            "    cli.main(sys.argv[2:])\n",
            "  File \"/media/pc/data/lxw/ai/mlc-llm/python/mlc_llm/cli/compile.py\", line 129, in main\n",
            "    compile(\n",
            "  File \"/media/pc/data/lxw/ai/mlc-llm/python/mlc_llm/interface/compile.py\", line 244, in compile\n",
            "    _compile(args, model_config)\n",
            "  File \"/media/pc/data/lxw/ai/mlc-llm/python/mlc_llm/interface/compile.py\", line 132, in _compile\n",
            "    with args.target:\n",
            "  File \"/media/pc/data/lxw/ai/tvm/python/tvm/target/target.py\", line 145, in __exit__\n",
            "    _ffi_api.TargetExitScope(self)\n",
            "  File \"/media/pc/data/lxw/ai/tvm/python/tvm/_ffi/_ctypes/packed_func.py\", line 245, in __call__\n",
            "    raise_last_ffi_error()\n",
            "  File \"/media/pc/data/lxw/ai/tvm/python/tvm/_ffi/base.py\", line 481, in raise_last_ffi_error\n",
            "    raise py_err\n",
            "  File \"/media/pc/data/lxw/ai/tvm/src/target/target.cc\", line 757, in tvm::Target::ExitWithScope()\n",
            "    ICHECK(!entry->context_stack.empty());\n",
            "                    ^^^^^^^^^^^^^^^^^^^^^^^\n",
            "tvm.error.InternalError: Traceback (most recent call last):\n",
            "  0: tvm::Target::ExitWithScope()\n",
            "        at /media/pc/data/lxw/ai/tvm/src/target/target.cc:757\n",
            "  File \"/media/pc/data/lxw/ai/tvm/src/target/target.cc\", line 758\n",
            "InternalError: Check failed: (entry->context_stack.top().same_as(*this)) is false: \n"
          ]
        }
      ],
      "source": [
        "!python -m mlc_llm compile {temp_dir}/dist/gpt2-q0f16-MLC/mlc-chat-config.json --device cuda -o {temp_dir}/dist/gpt2-q0f16-MLC/gpt2-q0f16-cuda.so"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Gq7WE9a7kkF7"
      },
      "source": [
        "(Debug-Compiled-MLC-Model-with-DebugChat)=\n",
        "## 使用 DebugChat 调试编译的 MLC 模型\n",
        "\n",
        "在成功编译模型库并转换模型权重后，检查模型是否生成正确的输出非常重要。一种检查方法是在相同的输入 tokens 下，将模型的输出 logits 与其 Huggingface PyTorch 版本的输出进行比较。\n",
        "\n",
        "为了帮助调试 MLC 模型，提供了 `mlc_llm.testing.DebugChat` 模块，该模块可以：\n",
        "\n",
        "- 加载刚刚编译的 MLC 模型\n",
        "- 使用用户指定的 prompt 运行模型的完整 `forward` 流程\n",
        "- 转储所有层的中间值。\n",
        "\n",
        "然后，您可以将这些中间值与 Huggingface PyTorch 模型的中间值进行比较。（对于 PyTorch，您可以使用 [`register_forward_hook`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.register_forward_hook) 提取中间值。）"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 20,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "INCCMvXvtC0O",
        "outputId": "83d84f91-934b-44ab-c81f-c2cdbe07bd62",
        "tags": [
          "hide-output"
        ]
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Traceback (most recent call last):\n",
            "  File \"<frozen runpy>\", line 198, in _run_module_as_main\n",
            "  File \"<frozen runpy>\", line 88, in _run_code\n",
            "  File \"/media/pc/data/lxw/ai/mlc-llm/python/mlc_llm/testing/debug_chat.py\", line 536, in <module>\n",
            "    main()\n",
            "  File \"/media/pc/data/lxw/ai/mlc-llm/python/mlc_llm/testing/debug_chat.py\", line 523, in main\n",
            "    dc = DebugChat(\n",
            "         ^^^^^^^^^^\n",
            "  File \"/media/pc/data/lxw/ai/mlc-llm/python/mlc_llm/testing/debug_chat.py\", line 227, in __init__\n",
            "    self.mod, self.params, self.metadata = _get_tvm_module(\n",
            "                                           ^^^^^^^^^^^^^^^^\n",
            "  File \"/media/pc/data/lxw/ai/mlc-llm/python/mlc_llm/testing/debug_chat.py\", line 49, in _get_tvm_module\n",
            "    ex = tvm.runtime.load_module(lib_path)\n",
            "         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
            "  File \"/media/pc/data/lxw/ai/tvm/python/tvm/runtime/module.py\", line 683, in load_module\n",
            "    raise ValueError(f\"cannot find file {path}\")\n",
            "ValueError: cannot find file /media/pc/data/lxw/ai/tvm-book/tests/.temp/dist/gpt2-q0f16-MLC/gpt2-q0f16-cuda.so\n"
          ]
        }
      ],
      "source": [
        "!python -m mlc_llm.testing.debug_chat --model {temp_dir}/dist/gpt2-q0f16-MLC/ --model-lib {temp_dir}/dist/gpt2-q0f16-MLC/gpt2-q0f16-cuda.so --device cuda --debug-dir {temp_dir}/debug-gpt2 --generate-len 5 \"Hey how are you doing today?\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "中间输出会被转储到 `debug-gpt2` 文件夹中。对于每个 prefill/decode 阶段，都有单独的文件夹，其中包含存储每个内核函数调用参数的 `.npz` 文件。\n",
        "\n",
        "例如：`./debug-gpt2/decode_2/f0_take3.npz` 对应第 2 个解码步骤中的第 0 个 `take` 函数调用。输出 logits 会保存到 `logits.npz` 中。\n",
        "\n",
        "**注意**：由于 TIR 函数调用采用[目标传递风格](https://mlc.ai/chapter_end_to_end/index.html#call-dps-packed-construct)，每个函数调用的参数会如下所示：\n",
        "\n",
        "```python\n",
        "def low_level_prim_func(in0, in1, ..., out):\n",
        "    # 实现\n",
        "```\n",
        "\n",
        "因此，函数调用的最后一个参数将是输出。\n",
        "\n",
        "`.npz` 文件可以按以下方式加载："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 21,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "E6ZFQEJIW2Gc",
        "outputId": "b4a010fc-7530-472f-a45b-fced474b622e"
      },
      "outputs": [
        {
          "ename": "FileNotFoundError",
          "evalue": "[Errno 2] No such file or directory: '/media/pc/data/lxw/ai/tvm-book/tests/.temp/debug-gpt2/decode_2/f0_take3.npz'",
          "output_type": "error",
          "traceback": [
            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
            "\u001b[0;31mFileNotFoundError\u001b[0m                         Traceback (most recent call last)",
            "Cell \u001b[0;32mIn[21], line 3\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mnumpy\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mnp\u001b[39;00m\n\u001b[0;32m----> 3\u001b[0m data \u001b[38;5;241m=\u001b[39m \u001b[43mnp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mload\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43mf\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;132;43;01m{\u001b[39;49;00m\u001b[43mtemp_dir\u001b[49m\u001b[38;5;132;43;01m}\u001b[39;49;00m\u001b[38;5;124;43m/debug-gpt2/decode_2/f0_take3.npz\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m      4\u001b[0m \u001b[38;5;28mprint\u001b[39m(data)\n\u001b[1;32m      5\u001b[0m \u001b[38;5;28mprint\u001b[39m(data[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124marg_0\u001b[39m\u001b[38;5;124m\"\u001b[39m])\n",
            "File \u001b[0;32m/media/pc/data/lxw/envs/anaconda3a/envs/ai/lib/python3.12/site-packages/numpy/lib/npyio.py:427\u001b[0m, in \u001b[0;36mload\u001b[0;34m(file, mmap_mode, allow_pickle, fix_imports, encoding, max_header_size)\u001b[0m\n\u001b[1;32m    425\u001b[0m     own_fid \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mFalse\u001b[39;00m\n\u001b[1;32m    426\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m--> 427\u001b[0m     fid \u001b[38;5;241m=\u001b[39m stack\u001b[38;5;241m.\u001b[39menter_context(\u001b[38;5;28;43mopen\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mos_fspath\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfile\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mrb\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m)\n\u001b[1;32m    428\u001b[0m     own_fid \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mTrue\u001b[39;00m\n\u001b[1;32m    430\u001b[0m \u001b[38;5;66;03m# Code to distinguish from NumPy binary files and pickles.\u001b[39;00m\n",
            "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '/media/pc/data/lxw/ai/tvm-book/tests/.temp/debug-gpt2/decode_2/f0_take3.npz'"
          ]
        }
      ],
      "source": [
        "import numpy as np\n",
        "\n",
        "data = np.load(f'{temp_dir}/debug-gpt2/decode_2/f0_take3.npz')\n",
        "print(data)\n",
        "print(data[\"arg_0\"])\n",
        "print(data[\"arg_1\"])\n",
        "print(data[\"arg_2\"]) # This is the output of the take function"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": []
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "T4",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "ai",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.7"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
